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Abstract 

Gaussian elimination with no pivoting and block Gaussian elimination are attractive alterna¬ 
tives to the customary but communication intensive Gaussian elimination with partial pivoting 1 
provided that the computations proceed safely and numerically safely, that is, run into neither 
division by 0 nor numerical problems. Empirically, safety and numerical safety of GENP have 
been consistently observed in a number of papers where an input matrix was pre-processed with 
various structured multipliers chosen ad hoc. Our present paper provides missing formal sup¬ 
port for this empirical observation and explains why it was elusive so far. Namely we prove that 
GENP is numerically unsafe for a specific class of input matrices in spite of its pre-processing 
with some well-known and well-tested structured multipliers, but we also prove that GENP 
and BGE are safe and numerically safe for the average input matrix pre-processed with any 
nonsingular and well-conditioned multiplier. This should embolden search for sparse and struc¬ 
tured multipliers, and we list and test some new classes of them. We also seek randomized 
pre-processing that universally (that is, for all input matrices) supports (i) safe GENP and 
BGE with probability 1 and/or (ii) numerically safe GENP and BGE with a probability close 
to 1. We achieve goal (i) with a Gaussian structured multiplier and goal (ii) with a Gaussian 
unstructured multiplier and alternatively with Gaussian structured augmentation. We consis¬ 
tently confirm all these formal results with our tests of GENP for benchmark inputs. We have 
extended our approach to other fundamental matrix computations and keep working on further 
extensions. 

2000 Math. Subject Classification: 15A06, 15A52, 15A12, 65F05, 65F35 

Keywords: Gaussian elimination; Pivoting; Block Gaussian elimination; Preconditioning; Ran¬ 
dom matrix algorithms; Linear systems of equations 
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Computing (CASC’2014), September 10-14, 2015, Aachen, Germany (cf. [PZ15]). 

1 Hereafter we use the acronyms GENP, BGE, and GEPP. 
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1 Introduction 

1.1 Gaussian elimination with pivoting 

The history of Gaussian elimination can be traced back some 2000 years [Gil]. Its modern version, 
GEPP (with partial pivoting), is performed routinely, millions times per day around the world, 
being a cornerstone for computations in linear algebra [DDF14]. For an n x n matrix, elimination 
involves about |n 3 flops, and (n — l)n/2 comparison are required for partial pivoting, that is, 
row interchange. Clearly pivoting contributes a substantial share to the overall computational 
cost if n is small, but even for larger n communication intensive pivoting takes quite a heavy toll 
in modern computer environment. It interrupts the stream of arithmetic operations with foreign 
operations of comparison, involves book-keeping, compromises data locality, impedes parallelization 
of the computations, and increases communication overhead and data dependence. According to 
[BDHT13], “pivoting can represent more than 40% of the global factorization time for small matrices, 
and although the overhead decreases with the size of the matrix, it still represents 17% for a matrix 
of size 10,000”. Because of the heavy use of GEPP, even its limited improvement is valuable. 

1.2 Random and nonrandom multipliers. Safety and numerical safety 

Gaussian elimination with no pivoting (GENP) is an attractive alternative to GEPP 3 , but for some 
inputs can be unsafe or numerically unsafe , that is, can run into a division by 0 or numerical 
problems, respectively. Empirically, GENP is quite consistently safe and numerically safe [PQZ13], 
[BDHT13], [PQY15], [DDF14] if the input matrix is pre-processed with various structured multipliers 
chosen ad hoc (e.g., with random circulant or SRTF multipliers), 4 but formal support for this 
empirical observation turned out to be elusive. 

We explain why it is elusive by exhibiting a class of input matrices for which GENP with a random 
circulant or SRFT multiplier is always numerically unsafe (see Section 7.2), but we also explain why 
pre-processing with such multipliers usually works - we prove that GENP is both safe and numerically 
safe for the average input matrix pre-processed with any nonsingular and well-conditioned multiplier 
provided that the average is defined under the Gaussian probability distribution (see Section 4.2), 
which is a customary assumption in view of the Central Limit Theorem. 

In Section 6 we list and in Section 9 successfully test some promising classes of such sparse and 
structured multipliers, and plan to work further in this direction. For inputs to our tests we used 
benchmark matrices from [BDHT13] or applied the customary recipes of [H02, Section 28.3]. 

1.3 Universal pre-processing 

In addition to our results on pre-processing GENP with a fixed multiplier for the average input 
matrix, we prove that pre- as well as post-multiplication by a Gaussian multiplier 5 are universally 
safe, that is, safe with probability 1 for any fixed nonsingular input matrix, and is universally 
numerically safe, that is, likely to be numerically safe for any nonsingular and well-conditioned input 
matrix (see Section 4.2). Actually we prove that it is numerically safe with a higher probability if 
the input matrix is both pre- and post-multiplied by Gaussian matrices (see Remark 4.4). 

In computations with infinite precision and with no rounding errors (e.g., in Computer Algebra) 
one needs just safe (rather than numerically safe) GENP, and we prove universal safety of GENP 
even in the case of pre- or post-multiplication by a random Gaussian circulant matrix (see Section 
7). This result immediately enables 4-fold acceleration of the classical, extensively cited, and highly 
popular two-sided pre-processing of [KS91]. 

2 Here and hereafter “flop” stands for “floating point arithmetic operation”. 

3 Another alternative is symmetrization, but it has deficiencies for both numerical and symbolic computations: it 
squares the condition number of the input matrix and does not work over finite fields. 

4 SRFT is the acronym for Semisample Random Fourier Transform 

5 Here and hereafter “Gaussian matrices” and “Gaussian pre-processing” stand for “standard Gaussian random 
matrices” and “standard Gaussian random pre-processing”, respectively. 
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We cannot prove universal numerical safety of GENP pre-processed with SRFT or any other 
random structured multipliers, but we prove that GENP pre-processed with SRFT augmentation 
or SRFT additive preprocessing is universally safe with probability 1 and is likely to be universally 
numerically safe (see Section 8.2). 

Compared to [PQZ13], [BDHT13], [PQY15], and [DDF14], our present tests cover pre-processing 
also with multipliers from new classes as well as by means of augmentation; most important, now 
our test results are in very good accordance with our new formal results. 

1.4 Block Gaussian elimination 

Block matrix algorithms are highly important resource for enhancing the efficiency of matrix compu¬ 
tations [GL13], but block Gaussian elimination (BGE), including, e.g., the MBA celebrated superfast 
algorithm for solving structured linear systems of equations, is prone to numerical stability problems 
(cf. [B85], [P01, Chapter 5]). In Section 3, however, we readily extend our results for GENP to 
BGE. 6 In particular a proper random structured pre-processing is likely to make BGE safe and nu¬ 
merically safe. This enables us to resurrect the MBA algorithm for numerical computations (see our 
Remark 5.6). In Remark 4.3 we accelerate our pre-processing for BGE by performing it recursively 
for leading block submatrices rather than once for the whole matrix. 

1.5 Organization of the paper 

We organize our presentation as follows. In the next section we cover basic definitions and some 
preliminary results. In Section 3 we define BGE and show that, for any input, safety as well as 
numerical safety of GENP imply the same properties also for BGE. In Section 4 we prove our basic 
theorems about safety and numerical safety of GENP (and consequently BGE) with multiplicative 
pre-processing and at the end cover some techniques for enhancing the power and efficiency of 
pre-processing. In Section 5 we recall the definitions and some basic properties of the matrices of 
Discrete Fourier transform and circulant and factor-circulant (/-circulant) matrices. In Section 6 
we cover some families of efficient multipliers. In Section 7 we prove that randomized circulant 
pre-processing for GENP and BGE is universally safe (the contribution of the second author) but is 
numerically unsafe for some specific input class of matrices. In Section 8 we prove universal safety 
of augmentation and additive pre-processing with Gaussian as well as SRFT pre-processors. Section 
9 (also the contribution of the second author) covers our numerical experiments with benchmark 
inputs and inputs generated according to [H02, Section 18.3]. In Section 10 we briefly recall our 
progress and then comment on its extension to low-rank approximation and other fundamental 
matrix computations. A number of our results for Gaussian pre-processing can be extended to 
pre-processing under the uniform probability distribution on a finite set (cf. Theorem 2.1). 

2 Basic definitions and preliminary results 

Hereafter “likely” and “unlikely” mean “with a probability close to 1” and, respectively, “to 0”. 
We call an m x n matrix Gaussian and denote it G m> „ if all its entries are i.i.d. standard Gaussian 
variables. g mxn , R mxn , and C mxn denote the classes ofmxn Gaussian, real and complex matrices, 
respectively. In order to simplify our presentation, we assume dealing with real matrices, except for 
the matrices of discrete Fourier transform of Section 5.1 and /-circulant matrices of Section 5.2 used 
in Sections 6 and 9, but we can readily extend our study to the complex case. 

2.1 General matrices: basic definitions 

In this subsection we recall some basic definitions for general matrix computations (cf. [GL13]). 

1. I g is a g x g identity matrix. Ok,i is the k x l matrix filled with zeros. 

®By combining our dual approach of Section 4.3 with the study in [YC97] of the growth factor in PLUP’ factorization 
of the input matrix, we can deduce the results similar to our present ones for GENP but not for BGE. 
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2. (B i | B 2 | • • • | -Bfc) is a block vector of length k, and diag(i?i, B 2 , ■ ■ ■ ,B]f) is a k x k block 
diagonal matrix, in both cases with blocks B\, B 2 , ■ ■., B^. 


3. Wk,i denotes the k x l leading (that is, northwestern) block of an m x n matrix W for k < m 
and l < n. A matrix is strongly nonsingular if all its square leading blocks are nonsingular. 

4. H(W), W T and W H denote its range (that is, column span), transpose and Hermitian trans¬ 
pose, respectively. W H = W T for a real matrix W. 

5. An to x n matrix W is called unitary (in the real case also and preferably orthogonal) if 
W H W = I n or WW H = Im • 

6. Q(W) denotes the matrix obtained by means of column orthogonalization of a matrix W. 
followed by the deletion of the columns filled with zeros (cf. [GL13, Theorem 5.2.3]). 

7. Ilkbll and ||fT||i? denote its spectral and Frobenius norms, respectively. 

8. Sw^wTw * s full Singular Value Decomposition (or full SVD) of an to x nmatrix W where 
Sw and Tw are square unitary matrices and Ew = diag(diag(cr J (W / ))^ =1 , O m _ Pj „_ p ), for 
p = rank(W), is the to x n diagonal matrix of the singular values of the matrix W, 

<7i(W)) > <t 2 {W)) > > a P (W)) > 0, <Tj(W)) = 0 for j > p. 

9. W = Sw,p^w, P Tw p com pact SVD, where Ew,p = diag(<7j(W))^ =1 is the p x p leading 
submatrix of Ew and the matrices Sw,p and T\y p are formed by the first p columns of the 
matrices S\y and TV, respectively. 

10. W+ = T w , p E^ S\v,p i s Moore-Penrose pseudo inverse. 

(W+)+ = W, WW + = I m if rank (W) = to, W + W = I n if rank(tV) = n, and W + = W~ x for 
a nonsingular matrix W. 

11. ||W|| = ai(W) and ||W||f = (J2j=i tfiW)) 1 / 2 = (Trace ( W H W )) 1 ^ 2 denote its spectral and 
Frobenius norms, respectively. 

||VIV|| < ||V|| ||W|| and ||V1V||f < ||V||F||kF||F, for any matrix product VW. 

Ill'll = ll^ + ll = 1) ||t^W|| = ||W|| and ||W[7|| = ||W|| if the matrix U is unitary. 

12. (jp{W) = \/\\W + \\. k{W) = ||W|| 11 W + \\ = a\(W)/a p (W) > 1 is the condition number of W. 

13. The £-rank of a matrix, for a positive £, is the minimum rank of its approximations within the 
norm bound £. The numerical rank of a matrix is its £-rank for £ small in context. 

14. A matrix W is ill-conditioned if its condition number is large in context or equivalently if its 
rank exceeds its numerical rank. The matrix is well-conditioned if its condition number is 
reasonably bounded. The ratio of the output and input error norms of Gaussian elimination 
is roughly the condition number of an input matrix (cf. [GL13]). 

2.2 Random matrices: definitions, some basic properties 

We use the acronym “ i.i.d .” for “independent identically distributed”, keep referring to standard 
Gaussian random variables just as Gaussian, and call random variables uniform over a fixed finite 
set if their values are sampled from this set under the uniform probability distribution on it. 

The matrix is Gaussian if all its entries are i.i.d. Gaussian variables. 

Theorem 2.1. Suppose that A is a nonsingular n x n matrix and H is an n x n matrix whose 
entries are linear combinations of finitely many i.i.d. random variables and let det ((AH)pi) vanish 
identically in them for neither of the integers l, l < n. (i) If the variables are uniform over a set 
S of cardinality |<S|, then the matrix (AH)pi is singular with a probability at most l/\S\, for any l, 
and so the matrix AH is strongly nonsingular with a probability at least 1 — 0.5(n — l)n/|<S|. (ii) If 
these i.i.d. variables are Gaussian, then the matrix AH is strongly nonsingular with probability 1. 
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Proof. Part (i) of the theorem follows from a celebrated lemma of [DL78]. Derivation is specified, 
e.g., in [PW08]. Claim (ii) follows because the equation det^Aff)^) for any integer l in the range 
from 1 to n defines an algebraic variety of a lower dimension in the linear space of the input variables 
(cf. [BV88, Proposition 1]). □ 

Lemma 2.1. (Orthogonal invariance of a Gaussian matrix. ) Suppose that k, m, and n are three 
positive integers, k < max{m, n}, G is an mxn Gaussian matrix, and S and T are kxm and nx k 
orthogonal (or unitary) matrices, respectively. Then SG and GT are Gaussian matrices. 

2.3 Norm of a Gaussian matrix and of its pseudo inverse 

Next we recall some estimates for the norms of a Gaussian matrix and of its pseudo inverse. For 
simplicity we assume that we deal with real matrices, but similar estimates in the case of complex 
matrices can be found in [CD05] and [ES05]. Hereafter we write v m ,n = ||G|| and n = ||G + || for 
a Gaussian m x n matrix G, and write E(i>) for the expected value of a random variable v. 

Theorem 2.2. (Cf. [DS01, Theorem II. 7].) Suppose that m and n are positive integers, h = 
max{m, n}, t > 0, and z > 2y/h- Then (i) Probability ]V m>n > t + yfrrx + y/n} < exp(—t 2 /2) and 
(ii) E( 1^771,71 ) < y/rn + y/n. 

Theorem 2.3. Let P(:r) = / 0 °° exp (—t)t x ~ 1 dt denote the Gamma function and let x > 0. Then 

(i) Probability {v+ n > m/x 2 } < r fZ-n+ 2 ) f or m>n> 2, 

(ii) Probability {iz+ n > x} < 2.35 y/n/x for n > 2, 

(Hi) E {(vp m n ) 2 ) = m/\m — n — 1|, provided that m > n + 1 > 2, and 
(iv) E(i/+ ) < ey/rn/\m — n\, for e = 2.7182818 ..., provided that m ^ n. 

Proof. See [CD05, Proof of Lemma 4.1] for claim (i), [SST06, Theorem 3.3] for claim (ii), and 
[HMT11, Proposition 10.2] for claims (iii) and (iv). □ 

Probabilistic upper bounds on n of Theorem 2.3 are reasonable already for square matrices, 
for which m = n, but become much stronger as the difference |m — n\ grows above 2. 

Theorems 2.2 and 2.3 combined imply that an m x n Gaussian matrix is very well-conditioned 
if the integer m — n is large or even moderately large, and still can be considered well-conditioned 
if the integer |m — n\ is small or even vanishes (possibly with some grain of salt in the latter case). 
These properties are immediately extended to all submatrices because they are also Gaussian. 


3 Recursive Factorization of a Matrix. BGE and GENP 


In this section we specify recursive factorization of a strongly nonsingular matrix, which we will use 
as a basis for our simultaneous study of safety and numerical safety of GENP and BGE. 

For a nonsingular 2x2 block matrix A = of size n x n with nonsingular k x k pivot 

block B = Ak t ki define S = S(Ak t k,A) = E — DB~ 1 C , the Schur complement of Ak,k hr A, and the 
block factorizations, 


( Ik O kt r\ ( B OkA ( Ik B~ l c\ 
{DB- 1 I r ) \O r , k S ) \Ok,r Ir )' 

( I k ~B~ 1 C\ (B- 1 OkA ( h OkA 
\Ok,r Ir ) \O t ,k S-ijy-DB - 1 I r )■ 


(3.1) 

(3.2) 


These factorizations represent Gauss-Jordan elimination applied to a 2 x 2 block matrix. 

We readily verify that S' -1 is the (n — k) x (n — k) trailing (that is, southeastern) block of the 
inverse matrix A -1 , and so the Schur complement S is nonsingular since the matrix A is nonsingular. 

Factorization (3.2) reduces the inversion of the matrix A to the inversion of the leading block 
B and its Schur complement S, and we can recursively reduce the inversion task to the case of 
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the leading blocks and Schur complements of decreasing sizes as long as the leading blocks are 
nonsingular. After sufficiently many recursive steps of this process of BGE, we only need to invert 
matrices of small sizes, and then we can stop the process and apply a selected black box inversion 
algorithm, e.g., based on orthogonalization. If we limit the number of recursive steps, we arrive at 
BGE dealing with large blocks and can use the benefits of block matrix algorithms, but if we keep 
recursive partitioning, then BGE eventually turn into GENP. 

Namely, in |4og 2 (n)] recursive steps all pivot blocks and all other matrices involved into the 
resulting factorization turn into scalars, all matrix multiplications and inversions turn into scalar 
multiplications and divisions, and we arrive at a complete recursive factorization of the matrix A. If 
k = 1 at all recursive steps, then the complete recursive factorization (3.2) defines GENP. 

Moreover, any complete recursive factorizations turns into GENP up to the order in which we 
consider its steps. This follows because at most n— 1 distinct Schur complements S = S(Ak,k, A), for 
k = 1,..., n — 1, are involved in all recursive block factorization processes for n x n matrices A 1 and 
so we arrive at the same Schur complement in a fixed position via GENP and via any other recursive 
block factorization (3.1). Hence we can interpret factorization step (3.1) as the block elimination of 
the first k columns of the matrix A , which produces the matrix S = S(Ak,k, A). If the dimensions 
d±,... ,d r and d\ ,..., df of the pivot blocks in two block elimination processes sum to the same 
integer k, that is, if k = d\ + ■ ■ ■ + d r = d\ + • • • + dr, then both processes produce the same Schur 
complement S = S(Ak,k, A). The following results extend this observation. 

Theorem 3.1. In the recursive block factorization process based on (3.1), the diagonal block and 
its Schur complement in every block diagonal factor is either a leading block of the input matrix A 
or the Schur complement S(Ah,h, Ak,k) for some integers h and k such that 0 < h < k < n and 

S(A h , h , A k ,k) = (S(A h ' h ,A)) h}h . 

Corollary 3.1. The complete recursive block factorization process based on equation (3.1) can be 
computed by involving no singular pivot blocks (and, in particular, no pivot elements vanish) if and 
only if the input matrix A is strongly nonsingular. 

Proof. Combine Theorem 3.1 with the equation det A = (det B) det S, implied by (3.1). □ 


4 Multiplicative Pre-processing for GENP and BGE 

4.1 Definition and criteria of safety and numerical safety 

In this section, A denotes a nonsingular n x n matrix. 

Suppose that the vector y = Ab satisfies pre-processed linear systems AH y = b and FAHy = 
F b. Then the vector x = H y for y = A _1 b satisfies both linear systems Ax = b and FAx = F b. 
We are going to study such pre-processing A —> AH for GENP and BGE with random and fixed 
post-multipliers H. Our analysis is immediately extended to the pre-processing maps A —> FA, 
A FAH, and A FAF T . 

We call GENP and BGE safe if they proceed to the end with no divisions by 0. 

Corollary 3.1 implies the following result for computations in any field (cf. Remark 4.1). 

Theorem 4.1. GENP is safe if and only if the input matrix is strongly nonsingular. 

Next assume that GENP and BGE are performed numerically, with rounding to a fixed preci¬ 
sion, e.g., the IEEE standard double precision. Then extend the concept of safe GENP and BGE to 
numerically safe GENP and BGE by requiring that the input matrix be strongly nonsingular and 
strongly well-conditioned, that is, that the matrix itself and all its square leading blocks be nonsin¬ 
gular and well-conditioned. Any inversion algorithm for a nonsingular matrix is highly sensitive to 
the input and rounding errors if the matrix is ill-conditioned [GL13]. GENP explicitly or implicitly 
involves the inverses of all its square leading blocks, and we arrive at the following Criterion of 
Numerical Safety of GENP implied by [PQZ13, Theorem 5.1]: 

GENP applied to a strongly nonsingular matrix is highly sensitive to the input and rounding 
errors if and only if some of the square leading blocks are ill-conditioned. 

Let us restate this criterion in the form similar to Theorem 4.1. 
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Theorem 4.2. GENP is safe and numerically safe if and only if the input matrix is strongly 
nonsingular and strongly well-conditioned. 

Remark 4.1. BGE is safe if so does GENP. Likewise BGE is safe numerically if so does GENP. 
Thus our proofs of safety and numerical safety of GENP apply to BGE. The converse is not true: 
GENP fails (resp. fails numerically) if any square leading block of the input matrix is singidar (resp. 
ill-conditioned), but BGE may by-pass this block and be safe (resp. numerically safe). 


4.2 GENP with Gaussian pre-processing 

Next we prove that GENP with Gaussian pre-processing is safe with probability 1 for any nonsingular 
input matrix and is likely to be numerically safe if this matrix is also well-conditioned. 


Theorem 4.3. Assume that we are given a nonsingular and well-conditioned n x n matrix A and 
a pair of nxn Gaussian matrices F and H and let u k k , v kn , and i/} k denote random variables of 
Section 2.3. Then 

(i) the matrices FA, AH, and FAH are strongly nonsingular with a probability 1, 

(ii) \\((AH)k,k) + \\ < u+ k \\At >n \\ < < fc ||A+||, ||((F4)fc,fc) + || < < fc ||< fc || < u+ k \\A+\\, and 
(in) ||((FAff)fc,fc) + || < ^„< fc min{||A+ ||,||A+ ||} < vt n vt k \\A+\\. 


Proof. Claim (i) follows from claim (ii) of Theorem 2.1. 

Hereafter a pair of subscripts p, q shows the matrix size p x q. The proof of claim (ii) is similar 
for both products AH and FA- we only cover the case of the former one. 

Notice that (AH) k , k = A k , n Hn,k and substitute compact SVD A kn = Sk, k ^k,kT)( k where T, k ,k 
is a diagonal matrix and S k ,k and T nik are orthogonal matrices. Obtain 


(AH)k,k — S k .kTik,hT nk hin,k — S k ,k^k,kG k .k 


where G k , k = TZ) k H n ,k is a k x k Gaussian matrix by virtue of Lemma 2.1. Deduce that 


((AH) k<k ) + = G+^Slk, and so \\((AH) k , k ) + \\ = ||G+ fe £^|| < ||G+* 


J k,k I 


Substitute the equations ||G^ fc || = v kk and ||£)“ fe || = ||4^" n || < ||A + || and obtain claim (ii). 

Let us prove claim (iii). Notice that ( F AH) k , k = F k , n AH n ^, substitute compact SVD 
A k ,n = S k ,k^k,kT„ tk , and obtain 


(F AH) k ,k = F k , n S k ,kZk,kTZ k Hn,k = G'£ fc>fc G" 

where G' = G k , n = F k , n Sk,k and G" = G n)k = TZ k Hn,k are Gaussian matrices by Lemma 2.1. 
Substitute compact SYDs G' = Sc ,k,k^G' ,k,kTc k n and G" = Sg", n ,k^G" ,k,kTc> k n and obtain 


(FAH) k ,k = Sc,k,kBTQ„ k ,n f° r B = T,c,k,kWT,c',k, k and W = Tq, k n Y, k , k SG",n,k- 


Observe that the latter equation is compact SVD, and so ||!V + || = ||£j!" fc || = ||A)jr n || < ||A + ||. 
Furthermore, clearly \\((FAH) k , k ) + \\ = ||F + || because Sc,k,k and Tc', k ,n are square orthogonal 
matrices. Now observe that 




l|£ + ll<P£,, fc , fc || \\W+\\ 

because T,c,k,k and £(j",fc,fc are square diagonal matrices. Notice that 




= HG ,+ ||=<„and||£ 


"' + I 

■ J G",k,k\ 


= ||G" + || = 


Combine the above equations for the norms ( FAH ) k , k , ||W /+ ||, 

\+\ 




,k- 


and 


|£+, 


,k,k I 


and the 


above bound on the norm ||B + || = \\((FAH) k ,k) + \\ and obtain ||((FAff) fcjfc ) + || < t' k , n l/ Z,k\\^k,n II- 
Apply this estimate to the norm \\{(FAH) k k ) + \\ = \\{{FAH) k k ) + \\ and complete the proof of claim 
(iii) by deducing that \\{{FAH) ktk )+\\ < V+V+WA+ ||. □ 
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Remark 4.2. [PQY15, Corollary 5.2] provides a correct, although very long proof of claim (ii) of 
Theorem f.3 in the case of post-multiplication by H, but states the result with an error, by writing 
v n ,k instead of the correct i ' k ,k- We fix the statement and include a short proof for the sake of 
completeness of our presentation. Claim (in) of the theorem is new and implies some important 
benefits of using two-sided rather than one-sided Gaussian pre-processing (see the next subsection). 

Theorems 2.2, 2.3, and 4.1-4.3 together imply the following result. 

Corollary 4.1. GENP is safe with probability 1 and is likely to be numerically safe if it is applied 
to the matrices FA, AH and FAH where A is a nonsingular and well-conditioned nxn matrix and 
F and H are Gaussian nxn matrices. 

Next we observe some additional benefits of GENP and BGE. 

Remark 4.3. Comparison of the estimates of claims (ii) and (Hi) of Theorem 4-3 shows that shifting 
from one-sided to two-sided Gaussian pre-processing of GENP is likely to enhance its power. Indeed 
suppose that the integer n — k is not small, say, exceeds 3. Then Theorem 2.3 implies that the 
norm u kk is likely to exceed substantially the product v/ n v/ k , that is, GENP is substantially safer 
numerically with two-sided Gaussian pre-processing until the size of the GENP input has been reduced 
to 4x4. At this point, however, it is inexpensive to complete Gaussian elimination by means of a 
reliable method such as GEPP, GECP, or orthogonalization. 

Remark 4.4. BGE can use additional benefits of block matrix algorithms and, rather surprisingly, 
of saving random variables and flops. E.g., first pre-process the k x k leading block of the input 
matrix for a proper integer k < n by using n x k Gaussian multipliers. Having factored this block, 
decrease the input size from n to n — k and then re-apply Gaussian pre-processing. Already by using 
such a two-step block pre-processing for k = n/2, we save 1/4 of all random variables and 3/8 of 
arithmetic operations involved. 


4.3 GENP with any nonsingular and well-conditioned pre-processing is 
safe and numerically safe on the average input 


In the previous subsection we assumed that an input matrix A is fixed, and the multipliers F and 
H are Gaussian. Next we prove a dual theorem where we assume that the multipliers are fixed and 
the input matrix is Gaussian. We obtain probabilistic estimates for the matrices ( AH ) kik , ( FA) k ,k , 
and ( FAH) k ,k similar to those of Theorem 4.3, and we will readily extend them to the estimates 
for pre-processed GENP applied to the average input matrix (see Corollary 4.2). 


Theorem 4.4. Assume that we are given a Gaussian nxn matrix A and a pair of nxn nonsingular 
and well-conditioned matrices F and H and let v kk denote a random variable of Section 2.3. Then 
(i) the matrices FA, AH, and FAH are strongly nonsingular with a probability 1, 


(ii)MAH) ktk )+\\<v+ k 
\+ 


I HI 


ll((E A)k,k)~ ||/ 2s •'k.k 


< V 

(Hi) \\((FAH)k,k) + \\ < WiKJW.kWHtkW < \\F + \k 


II Fk 


+ 

k,n 11 ; 

tk\\H + \\- 


and 


Proof. The proof is similar to that of Theorem 4.3, and we only specify the proof of claim (iii). 
Recall that ( FAH) ktk = F k ^ n AH n}k , substitute compact SVDs F k , n = SF,k,k^F,k,kT/ n k and 
H n , k = Sji,n,k^- i H,k,k r TH,k,k' > and obtain (FAH) kk = S F ^F,k,kG k ,k^H,k,kT H ^ k . Here G k ^ k — 
T ] / k n ASn,n,k is a k x k Gaussian matrix by virtue of of Lemma 2.1. The matrices S F ,k,k , Spii, 
E H,k,k > T F ,k,k are nonsingular by assumption, and so is the matrix G k<k with probability 1 by virtue of 
claim (ii) of Theorem 2.1. Hence with probability 1, ((FAH) ktk ) + = T H ^ k ^ k Y// k k G/^ k E// k k S r / k k , 


and so \\((F AH) k , k ) + \\ < \\Tn, k ,k\ 


\\T H ,k,k\\ = 11 S' 
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F,k,k I 


= 1, IIE 


F,k,k\ 
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k,k\ 


HE 

t-i 
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■»-i I 


and ll G fc,fcll = Pk.k • 
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Theorems 4.2, 4.3, 2.2, and 2.3 together imply the following result. 


Corollary 4.2. GENP is safe and numerically safe when it is applied to the matrices FA, AH and 
FAH where A is the average nxn matrix defined under the Gaussian probability distribution and 
F and H are n x n nonsingular and well-conditioned matrices. 



4.4 Heuristic amelioration of pre-processing for GENP 

Suppose that A, F, and H are a nonsingular and well-conditioned matrix A has been pre-processed 
with multipliers Ft and/or II, recursively sampled from two sufficiently large sets F = {£/}i and F = 
{Hi\i of nx n matrices generated independently of each other and of the matrix A. Then Corollary 
4.2 implies that application of GENP with pre-processing by a nonsingular and well-conditioned 
multipliers Fj and/or it,; is safe and numerically safe for most of nonsingular and well-conditioned 
input matrices A. This somewhat informal claim is in good accordance with our empirical study, 
although for any multipliers F we can readily exhibit bad nonsingular and well-conditioned inputs 
A for which GENP applied to the matrix FA fails numerically, and similarly for GENP applied to 
matrices AH and FAH. 

Next we comment on two heuristic recipes for choosing multipliers. 

(i) Clearly the product of two sparse matrices has good chances to have singular square leading 
blocks, and so one can be cautious about pre-processing a sparse matrix with sparse multipliers. For 
a partial remedy, we can more evenly distribute nonzero entries throughout a sparse multiplier, but 
for a more reliable remedy, we can apply dense structured multipliers. 

(ii) Here is a general useful heuristic recipe for simplifying repeated pre-processing when GENP 
has failed numerically for two matrices AHi and AH 2 : apply GENP to the sum AHi + AH 2 or 
product AH\H 2 . In our tests in Section 9, this recipe has consistently worked, but there are also 
other attractive options, e.g., using linear combinations or polynomials in H\ and H 2 as multipliers. 


5 Two Classes of Basic Structured Matrices for Generation 
of Efficient Multipliers 

5.1 Matrices of discrete Fourier transform 

Definition 5.1. Write to = exp(2^EI) ; H = (w y )"// 0 , -^11 is unitary, H _1 = , 

u> denotes a primitive n-th root of unity, O and f2 _1 denote the matrices of the discrete Fourier 
transform at n points and its inverse, to which we refer as DFT(n ) and IDFT(n), respectively. 

Remark 5.1. If n = 2 k is a power of 2, we can apply the FFT algorithm and perform DFT(n) 
and IDFT(n ) by using only 1.5nlog 2 (?i) and 1.5nlog 2 (n) + n arithmetic operations, respectively. 
For an nx n input and any n, we can perform DFT(n) and IDFT(n) by using cn log(?r) arithmetic 
operations, but for a larger constant c (see [P01, Section 2.3]). 


5.2 Circulant and /-circulant matrices 


For a positive integer n and a complex scalar /, define the n x n unit /-circulant matrix Zf = 
^ 0 ^ and the n x n general f-circulant matrix Zf(v) = Yl'i-o v iZf, 


/o 


Zf = 


1 

0 

J 1 

0 

and Zf(v) = 

/ v 0 fv n - 1 

Vi Vo 

fvi \ 

Vo ... 

0 

0 


V^n-l 

i O 

1 

0 ) 



Z = Zq is the unit down-shift matrix. Zq(v) is a lower triangular Toeplitz matrix, Zi(v) is a 
circulant matrix. Z™ = fl n , and the matrix Zf(v) is defined by its first column v = (u,)"^ 1 . 

We call an /-circulant matrix a Gaussian f -circulant (or just Gaussian circulant if / = 1) if its 
first column is filled with independent Gaussian variables. For every fixed /, the /-circulant matrices 
form an algebra in the linear space of n x n Toeplitz matrices 


T — ( ti-j ) 


n—1 

i,j= O' 


(5.1) 
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Hereafter, for a vector u = (uj)" -_ 0 , write Z?(u) = diag(wo, • ■ •, u n -i), that is, D( u) is the 
diagonal matrix with the diagonal entries uq, ..., u n —i. 

Theorem 5.1. (Cf. [P01, Theorem 2.6.4].) Iff 7^0, then f n -circulant matrix Zfn(y) ofsizenxn 
can be factored as follows, Zfn(v) = Uj 1 D(Ufv)U) for Uf = LID( f), f = and f / 0. In 

particular, for circulant matrices, D( f) = I, Uf = fl, and Z\(v) = 

Remark 5.2. We cannot extend this theorem directly to a k x k Q-circulant (that is, triangular 
Toeplitz) matrix T, but we can embed this matrix (as well as any k x k Toeplitz matrix T) into an 
nx n circulant matrix C for n > 2k — 1 and then can readily extract the product Tv (for any vector 
v) from the product C w for a proper vector w having v as a subvector. 

Remark 5.3. For an nx n Toeplitz or Toeplitz-like matrix A and annxn circulant matrix H, one 
can compute a standard displacement representation of the product AH by applying just 0{n log(n)) 
flops (see definitions and derivation in [P01]). 

Corollary 5.1. (i) Suppose that we are given a diagonal matrix D( u) = diag(«o, • • •, it n -i) for 
u = fiv. Then we can recover the vector v = TQ h u, defining the circulant matrix Z\(v). 

(ii) If the vector v is Gaussian, then so is also the vector u = (uj)" =1 = (by virtue of 

Lemma 2.1) and vice versa. Each of the two vectors defines a Gaussian circulant matrix Z\ (v). 

(Hi) By choosing Ui = exp(^v / —1) and real Gaussian variable <j>i for all i, we arrive at a random 
real orthogonal or unitary nx n circulant matrix Zfav) defined by n real Gaussian parameters fa, 
i = 0,..., n — 1. Alternatively we can set fa = ±1 for all i and choose the signs ± at random, with 
i.i.d. probability 1/2 for all signs. 

(iv) By adding another Gaussian parameter <j>, we can define a random real orthogonal or unitary 
f -circulant matrix Zf(v) for f = exp(-^y/—l). 

The following results imply that a Gaussian circulant matrix is likely to be well-conditioned. 

Theorem 5.2. (Cf. [PSZ15].) Suppose that Z i(v) = fl H Dfl is a nonsingular circulant nxn matrix, 
and let D( g) = diag(gi)? = i, f or S = Then ||^i(v)|| = max/ =1 \gf\, ||^i(v) _1 || = min™ =1 \gfl, 

and k(Zi(v)) = max£ j=1 \g l /g j \, for v = fl -1 g. 

Remark 5.4. Suppose that a circulant matrix ^i(v) has been defined by its first column vector v 
filled with integers ±1 for a random choice of the i.i.d. signs ±, each + and — chosen with probability 
1 / 2 . Then, clearly, the entries gt of the vector g = flv = ( c / i )” =1 satisfy \gi\ < n for all i an n, and 
furthermore, with a probability close to 1, max™ =1 log(1 /1|) = 0(log(n)) as n —> oo. 

Remark 5.5. In the case of a Gaussian circulant matrix Z\(y), all the entries gt are i.i.d. Gaussian 
variables, and the condition number k(Z i(v)) = max^ - =1 \gi/9j\ is not likely to be large. 

Remark 5.6. The superfast but numerically unstable MBA algorithm (cf. [B85], [P01, Chapter 5]) 
is precisely the recursive BGE, accelerated by means of exploiting the displacement structure of the 
input matrix throughout the recursive process of BGE. The algorithm is Fortunately, pre-processing 
with appropriate random structured multipliers is likely to fix these problems keeping the algorithm 
superfast, that is, using almost linear number of flops (cf. [P01, Sections 5.6 and 5.7]). 

6 Generation of Efficient Multipliers 

6.1 What kind of multipliers do we seek? 

Trying to support safe and numerically safe GENP we seek multipliers with the following properties: 

1. Multipliers F and H must be nonsingular and well-conditioned. 

2. The cost of the computation of the product FA, AH, FAH or FAF H should be small. 

3. Random multipliers should be generated by using fewer random variables. 
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4. For structured input matrices, the multipliers should have consistent structure. 

5. The multipliers should enable GENP to produce accurate output with a probability close to 1 
even with using no iterative refinement, at least for the inputs of reasonably small sizes. 

These rules give general guidance but are a subject to trade-off: e.g., by filling multipliers with 
values 0, 1, and —1, we save flops when we compute their products with the input matrix A, but 
we increase the chances for success of our pre-processing if instead we fill the multipliers with real 
or complex random variables. 

Some families of our new nonsingular multipliers for GENP and BGE refine (towards the above 
properties) the customary families of sparse and structured rectangular multipliers used for low-rank 
approximation (cf. [HMT11], [Mil], [W14]). E.g., we sparsify the SRHT and SRFT multipliers, 
which substantially simplifies their generation and application as multipliers. 

Remark 6.1. Property 5 was satisfied in our tests for most of our multipliers (unlike the multi¬ 
pliers of [BDHT13] and [DDFlf]). This property is important for n x n input matrices of smaller 
sizes: refinement iteration involves 0(n 2 ) flops versus cubic cost of |n 3 flops, involved in Gaussian 
elimination, but for small n quadratic cost of refinement can make up a large share of the overall 
cost. A single refinement iteration was always sufficient (and more frequently was not even needed) 
in our tests in order to match or to exceed the output accuracy of GEPP (see also similar empirical 
data in [PQZ13], [BDHT13], [DDFIf], and [PQY15] and see [H02, Chapter 12], [GL13, Section 
3.5.3], and the references therein for detailed coverage of iterative refinement). 

6.2 Some basic matrices for the generation of multipliers 

We are going to use the following matrix classes in the next subsections as our building blocks for 
defining efficient multipliers. 

1. P denotes an n x n permutation matrix. 

2. D denotes a unitary or real orthogonal diagonal matrix diag(dj)"_T 0 , with fixed or random 
diagonal entries di such that |dj| = 1 for all i, and so each of n entries di lies on the unit circle 
{x : \z | = 1}, being either nonreal or ±1. 

3. Define a Givens rotation matrix G(i,j,9) (cf. [GL13]): for two integers i and j, 1 < i < j < n, 
and a fixed or random real 9, 0 < 6 < 2i r, replace the submatrix I 2 in the ?’th and jth rows and 
columns of the identity matrix /„ by the matrix ( fl s *) where c = cos 9, s = sin0, c 2 + s 2 = 1. 

4. Define a Householder reflection matrix I n — by a fixed or random vector h (cf. [GL13]). 

We also use a DFT matrix f l n and transforms DFT, DFT(n) and IDFT(n) of Section 5.1 as 
well as circulant and /-circulant matrices of Section 5.2. Besides unitary and orthogonal diagonal 
matrices D , defined above, we use a nonsingular and well-conditioned diagonal matrix diag(±2 hi ) 
in Family 3 of Section 9, for random integers bi uniformly chosen from 0 to 3. 

6.3 Four families of dense structured multipliers 

FAMILY 1. The inverses of bidiagonal matrices: 

H = (I n + DZ)- 1 or H = (/„ + Z t D)~ 1 

for a diagonal matrix D and the down-shift matrix Z of Section 5.2. 

We can randomize a matrix H by choosing n— 1 random diagonal entries of the matrix D (whose 
leading entry makes no impact on H) and can pre-multiply it by a vector by using 2 n — 1 flops. 

||t/j| < s/n because nonzero entries of the triangular matrix H = ( I n + DZ)~ X have absolute 
values 1, and clearly ||tt~ 1 || = \\I n + DZ\\ < \/2. Hence k(H) = ||R|| ||R _1 || (the spectral condition 
number of H) cannot exceed \/2n for H = (/„ + DZ)~ X , and likewise for H = ( I n + Z T D)" 1 . 
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FAMILY 2. Chains of scaled and permuted Givens rotations with a DFT factor. A permutation 
matrix P and a sequence of angles 9\, ... 1 9 n -\ together define a permuted chain of Givens rotations 

n —1 

G{0 w ..,6 n - 1 )=Pl[G(i,i + l, 6 i ) for n = 2 k . 

i=l 

Combine two such chains G i and G 2 with scaling and DFT into the following dense unitary matrix, 

H = —=DiGiD 2 G 2 D 3 Q n 

y/n 

(cf. [HMT11, Section 4.6]), for diagonal matrices Di , D 2 and D 3 . This matrix can be multiplied by 
a vector by using about 10?r flops plus the cost of performing DFT. By randomizing diagonal and 
Givens rotation factors of H we can involve up to 7n — 2 random variables. 

FAMILY 3. Pairs of scaled and permuted Householder reflections with a DFT factor. Define 
the unitary matrix 

H = —j=DiRiD 2 R 2 D^,Fl n 

\Jn 

where D 1 , D 2 , and D 3 are diagonal matrices, Ri and R 2 are Householder reflections, and is a 
DFT matrix of Section 5.1. This matrix can be multiplied by a vector by using about In flops plus 
the cost of performing DFT. By randomizing diagonal and Householder reflection factors of H we 
can involve up to 6n random variables. 

FAMILY 4. f-circulant matrices of Section 5.2. H = Zf(v) = Y^i=o v i^fy f° r the matrix Zf of 
/-circular shift, defined by a scalar / ^ 0 and its first column vector v = . By randomizing 

this vector we can involve up to n random variables, and then Theorem 5.2 and [PSZ15] enable us 
to bound the condition number n(H). By virtue of Theorem 5.1, Zf(v ) = ( FDf)~ l DFDf where 
D is a diagonal matrix, Df = diag(/*)”Z 0 1 , and Q n is the DFT matrix of Section 5.1. Based on this 
expression, we can multiply the matrix H by applying two DFT(n), an IDFT(rc), and additionally 
n + 26fn multiplications and divisions where <5/ = 0if/ = l and <5/ = 1 otherwise. 

6.4 Sparse ARSPH matrices based on Hadamard’s processes 

FAMILY 5. A 2 fc x 2 k Abridged Recursive Scaled and Permuted Hadamard (ARSPH) matrix 
H = H 2 k d of depth d has q = 2 d nonzero entries in every row and in every column, for a fixed 
integer d, 1 < d < k. Hence such a matrix is sparse unless k — d is a small integer. 

A special recursive structure of such a matrix allows highly efficient parallel implementation of 
its pre-multiplication by a vector (cf. Remark 6.2). 

We recursively define matrices H 2 h+i d for h = k — d,... ,k — 1, as follows: 


H 2 h+i d 


D 2 h,+ 1 P 2 h+1 


( H 2 h' d 
\H 2 h d 


Fh 2 h d \ 

~H 2 h^d) 


P^h + l D 2 ^ + 1 7 


H 2 k-d d 


(I 2 k-d 
\I 2 k-d. 


I 2 k — d 
I 2 k — d 


( 6 . 1 ) 


Here P 2 h+i and P 2 h +1 are 2 h+1 x 2 h+1 permutation matrices, D 2 h +1 and D 2 h+i are 2 h+1 x 2 h+1 
diagonal matrices. We can pre-multiply a matrix H 2 k d by a vector by using at most 3 dn flops. 

If the matrices D 2 k +1 or D 2 h+i are real, having nonzero entries ±1, then these flops are additions 
and subtractions, and matrix H 2 k d is orthogonal up to scaling by a constant; otherwise it is unitary. 

For random permutation matrices Pi and Pi and the diagonal matrices Di and Di 1 the matrix 
B = H 2 k d depends on up to (1 + 1/2 + • • • + l/2 d )2 fc+2 = (1 — l/2 d )2 fc+3 random variables. 

For d = k, the matrix H 2 k d of (6.1) turns into a dense (unabridged) RSPH matrix. 

By letting D 2 h 1 D 2 h I 2 h , P 2 h = P 2 h == I 2 h , or D 2 h = D 2 h = P 2 h == P 2 h = I 2 h for all h , 
we arrive at the three sub-families ARPH, ARSH, and AH of the family of ARSPH matrices. For 
d = k, an AH matrix turns into the dense (unabridged) matrix of Walsh-Hadamard transform. 
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Special sub-families of 2 k x 2 k Abridged Scaled and Permuted Fourier (ASPF) and Abridged 
Scaled and Permuted Hadamard (ASPH) matrices use the same initialization of (6.1), 


^ 2 k ~ d ,d ~ H 2 k-d d — 


l^k — d l^k — d 

l^k—d l^k—d 


and are defined by the following recursive processes, which specialize (6.1), 


h-\-i d — JF^h+i 







H 2 h+l d 


( H 2 h, d 

\H 2 h ,d 


H 2 h ,cL \ 
-H 2 K td ) ’ 


( 6 . 2 ) 


for h — k — d,.k — 1 (cf. [P01, Section 2.3] and [Mil, Section 3.1]), and output the matrices 
PPl 2 k d D or PH 2 k d D for fixed or random matrices P and D of primitive types 1 and 2, respectively. 

Here D 2 h = diag(w 2 h)i = o 1 an d F 2 h is the x 2 h odd/even permutation matrix, such that P 2 h (u) = 

v, u = (ui)? = o :1 , v = (ui)? = o : \ vj = u 2 j, v j+ 2 h-i = u 2j + i, and j = 0,1, ... , 2 h ~ x - 1. 

The sub-families of ASPF and ASPH matrices in turn have sub-families of ASF, APF, ASH, and 
APH matrices. The n x n AF matrix for d = k is just the matrix DFT(n), fl n of Definition 5.1. 

Recursive process (6.2) defining the matrix Q n is known as the decimation in frequency (DIF) 
radix-2 representation of FFT; transposition turns it into the decimation in time (DIT) radix-2 
representation of FFT. The numbers of random variables involved into generation of general ARSPF 
and ARSPH matrices decrease to at most (1 — l/2 d )2 fc+1 for ASPF and ASPH matrices, further 
decrease to at most (1 — l/2 d )2 fc_1 for ASF, APF, ASH, and APH matrices, and the AF and AH 
matrices involve no random variables. The estimated arithmetic cost of pre-multiplication of all 
these submatrices by a vector is the same as in the case of ARSPF and ARSPH matrices. 

Another well-known special case is given by recursive two-sided Partial Random Butterfly Trans¬ 
forms (PRBTs), based on two unpublished Technical Reports of 1995 by Parker and by Parker and 
Pierce, but improved, carefully implemented, and then extensively tested in [BDHT13]. 

For an n x n input matrix and even n = 2 q, that paper defines PRBT as follows, 


B {n) 


~i R 

V 2 \R 



75 0: -\) diasws) 


(6.3) 


where R and S are random diagonal nonsingular matrices. The paper [BDHT13] defines multipliers 
F and H recursively by using PRBT blocks. According to [BDHT13], the two-sided recursive 
processes of depth d = 2 with PRBT blocks are “sufficient in most cases”. In such processes 
F = diag(Bj"/ 2 \ B^^^B^, and the multiplier H is defined similarly. In the case of depth-d 
recursion, d > 2, each of the multipliers F and H is defined as the product of d factors made up of 
2 J diagonal blocks of size n/ 2 J x n/2- J , for j = 0,..., d — 1, each block of the same type as above, 
and the two-sided multiplication by a vector involves 6 dn flops for = 2 k . 


6.5 Four additional families of sparse multipliers 

FAMILY 6. Sparse f-circulant matrices H = Zf(y) are defined by a fixed or random scalar /, 
|/| = 1, and the first column v having exactly q nonzero entries, for q <C n. The positions and values 
of non-zeros can be randomized (and then the matrix depends on up to 2 n + 1 random variables). 

Such a matrix can be pre-multiplied by a vector by using at most (2 q — 1 )n flops or, in the real 
case where / = ±1 and ty = ±1 for all i, by using at most qn additions and subtractions. 

FAMILY 7. Abridged /” -circulant n x n matrices H of a recursion depth d for a scalar /, 
n = 2 k , and two integers d and k such that |/| = 1, k > d > 0, and the integer k — d is not small, 

H = ( MD f )~ l DMDf. 

Here D = diag(di)F=i an d Df = diag(/*)™7 0 1 are diagonal matrices, | d, \ = 1 for all i and M is an 
AF or AH matrix of Family 5 of recursion depth d. 
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Such a matrix H is unitary up to scaling by a constant (or orthogonal if the matrices M. D and 
Df are real) and can be pre-multiplied by a vector by using at most 6 dn flops. It involves up to n 
random variables, but would involve 3 n (resp. 2n) such variables if we use an ASPF or ASPH (resp. 
ASF, ASH, ASF, or APH) rather than an AF or AH matrix F. For d = k the AF matrix M turns 
into the DFT matrix Cl n and H turns into the g-circulant matrix (cf. Theorem 5.1) 

Z g (y) = Dj^DftnDf, for g = f n , D f = diag (f i )'&,D = diag (d^ 1 , and (d ^ 1 = H„D /V . 

For / = 1, the above expressions are simplihed: g = 1, Df = M = f l n , and H = Vi ^i * s a 

circulant matrix Z\(y) = l n , D = diag(di)™Z 0 1 , for {di)™F 0 = H„v. 

FAMILY 8. Scaled and permuted chains of Givens rotations with an AF or AH factor. A 
permutation matrix P and n — 1 angles 9\,... ,0 n -i together define a permuted chain of Givens 
rotations, 

n—1 

G(0 1 ,...,9 n - 1 ) = P]jG(i,i + l,0 i ) for n = 2 k . 

i—1 

One can combine two such chains G\ and G 2 with scaling and DFT into the dense unitary matrix 


H = D 1 G 1 D 2 G 2 D 3 n n 


(cf. [HMT11, Section 4.6]), for diagonal matrices D i, D 2 and D 3 of primitive type 2. 

For n = 2 fc 2 d we can make matrix H sparse by replacing the factor Q n with an n x n AF or 
AH matrix; then we can pre-multiply this matrix by a vector by using (1.5d + 10)?r flops. 

A matrix H involves up to 7 n random variables, but we can increase this bound by 2 n (resp. by 
n ) if we replace the factor with an ASPF or ASPH (resp. by APF, APH, ASF, or ASH) rather 
than AF or AH matrix. 

FAMILY 9. Pairs of scaled and permuted Householder reflections with an AF or AH factor. 
Define the unitary matrix 

H = —=DiRiD 2 R 2 D 3 Q. n 
\Jn 

where D 2 , and D 3 are diagonal matrices, R\ and R 2 are Householder reflections, and is 
a DFT matrix. This matrix can be multiplied by a vector by using about 7 n flops plus the cost 
of performing DFT. By randomizing diagonal, permutation and Householder reflection matrices we 
can involve up to 7 n random variables. For n = 2 k and sparse vectors h; defining Householder 
reflections i?,;, for i = 1,2, we can make matrix H sparse by replacing the factor Q n with an n x n 
AF or AH matrix of recursion depth d such that 2 d <C n; then we can multiply this matrix by a 
vector by using at most (1.5d + 7)n flops. 

6.6 Estimated numbers of random variables and flops 

Table 6.1 summarizes upper bounds on (a) the numbers of random variables involved into the 
matrices B of Families 1-9 and (b) the numbers of flops for multiplication of such a matrix B by a 
vector.' Compare these data with n 2 random variables and (2n — l)n flops involved in the case of 
a Gaussian n x n multiplier and see more refined bounds in Sections 6.3-6.5. 

Remark 6.2. Other observations besides flop estimates can be decisive. E. g., a special recursive 
structure of an ARSPH matrix H 2 k allows highly efficient parallel implementation of its multiplica¬ 
tion by a vector based on Application Specific Integrated Circuits (ASICs) and Field-Programmable 
Gate Arrays (FPGAs), incorporating Butterfly Circuits [DE], 

' The matrices of Families 2-4 involve up to 7 n — 2, 7n, and n random variables, respectively, and are multiplied 
by a vector by using 0(nlog(n)) flops. 
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Table 6.1: The numbers of random variables and flops 


Family 

1 

5 

6 

7 

8 

9 

random variables 

mam 

;g||gg| 

2 q 1 

n 

7n — 2 

7 n 

flops 

2 n- 1 

3 dn 

( 2 q — l)n 

6 dn 

(1.5d + 10)n 

(1.5d + 7)n 


6.7 Other basic families of multipliers 

There is a variety of other interesting basic matrix families. E.g., one can generalize Family 6 to the 
family of sparse matrices with q nonzeros entries ±1 in every row and in every column for a fixed 
integer q, 1 < q -C n. Such matrices can be defined as the sums P t for fixed or random 

permutation matrices Pi and diagonal matrices Di 1 involve up to qn random variables, and can be 
pre-multiplied by a vector at the same estimated cost as sparse /-circulant matrices. 

For another example, one can modify a Givens chains of the form D 1 G 1 D 2 G 2 D 3 F, for F denoting 
an DFT, AH, ASPF, ASPH, APF, APH, ASF, or ASH matrix, by replacing one of the matrices G\ 
or G 2 with a permuted Householder reflection matrix or the inverse of a bidiagonal matrix. 

The reader can find more families of multipliers in our Section 9. 


7 Symbolic and numerical GENP with one-sided randomized 
circulant pre-processing 

7.1 GENP with one-sided randomized circulant pre-processing is likely 
to be safe universally (for any input) 

In symbolic application of GENP one only cares about its safety rather than numerical safety. In 
this case the power of randomization is strengthened. Claims (i) of Theorems 4.3 and 4.4 imply that 
GENP pre-processed with any nonsingular multiplier F or FI (e.g., FI = I n ) or with any pair of such 
multipliers is unsafe only for an algebraic variety of lower dimension in the space of all inputs and 
that for a fixed input GENP is safe if pre-processed with almost any nonsingular multiplier or any 
pair of such multipliers apart from an algebraic variety of lower dimension. Therefore in symbolic 
computations one can quite confidently apply GENP with no pre-processing and in the very unlikely 
case of failure re-apply GENP with a random nonsingular multiplier. 

For a theoretical challenge, however, one can seek randomized multipliers that support safe 
GENP and BGE universally, that is, with probability 1 for any nonsingular input. This challenge 
has been met already in 1991 (see, e.g., [BP94, Section 2.13, entitled “Regularization of a Matrix 
via Preconditioning with Randomization”]). Among the known options, one-sided pre-processing 
with random Toeplitz multipliers of [KP91] is most efficient, but a little inferior two-sided pre¬ 
processing with random triangular Toeplitz multipliers of [KS91] has been more widely advertised 
in the Computer Algebra community and has become more popular there. 

Next we prove that even pre-processing with one-sided Gaussian or random uniform circulant 
multipliers (see Section 5.2 for definitions) is likely to make GENP safe, that is, involving no divisions 
by 0. Using circulant multipliers saves 50% of random variables and enables a 4-fold (resp. 2-fold) 
acceleration of the pre-processing of [KS91] (resp. [KP91]). 

We need more than two pages besides the space for definitions in order to prove that result, but 
it enables us to improve substantially the popular and decades-old recipes for pre-processing GENP 
for symbolic computations. Namely, pre-processing of [KS91] requires pre- and post-multiplication 
of an n x n input matrix A by an upper and a lower triangular Toeplitz matrices, respectively, at 
the overall cost dominated by the cost of performing twelve DFT(n) per row of an input matrix A 
(see Remark 5.2), and in addition one must generate 2n — 1 random variables. Pre-processing of 
[KP91] uses as many random variables and six DFT(n) per row of A. Our present algorithm only 
post- or pre-multiplies a matrix A by a single circulant matrix, at the cost dominated by the cost of 
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performing three DFT(n) per row of an input matrix A, and uses only n random variables. 

Let us supply the details. 

Theorem 7.1. Suppose A = is a nonsingular matrix, T = (tj_j + i)" - =1 is a Gaussian 

f-circulant matrix, B = AT = f is a fixed complex number, ti,... ,t n are variables, and 

tk = ft n +k for k = 0, —1,..., 1 — n. Let Biy denotes the l-th leading blocks of the matrix B for 
l = 1,... ,n, and so det are polynomial in t\,... ,t n , for all l, l = 1 ,... ,n. Then neither of 
these polynomials vanishes identically in t \,..., t n . 

Proof. Fix a positive integer l < n. With the convention a k ± n = f&k, for k = 1,..., n, we can write 

n n n 

B u = ^ ^ ' a kl t kl , ^ ' ak 2 +itk 2 , • • •, ^ ' cifej+z—) (7.1) 

fci = l k2=l ki=l 

where ctj is the jth column of A; >n . Let ai t j+ n = .fai,j, for k = 1,..., n, and readily verify that 

n 

bij = 'y ' aij+ k —\t k , 

fe =1 

and so det(B;) is a homogeneous polynomial in ti,... ,t n . 

Now Theorem 7.1 is implied by the following lemma. 

Lemma 7.1. If det {Biy) = 0 identically in all the variables t\,... ,t n , then 

det(a il , a i2 ,..., a it ) = 0 (7.2) 

for all l-tuples of subscripts {i \,..., if) such that 1 < i\ < ii < • •• <ii < n. 

Indeed let A kn denote the block submatrix made up of the first l rows of A. Notice that if 
(7.2) holds for all /-tuples of the subscripts (i±,... ,ii) above, then the rows of the block submatrix 
Ai t . n are linearly dependent, but they are the rows of the matrix A, and their linearly dependence 
contradicts the assumption that the matrix A is nonsingular. 

In the rest of this section we prove Lemma 7.1. At first we order the /-tuples I = (ii,... ,if), 
each made up of / positive integers written in nondecreasing order, and then we apply induction. 

We order all /-tuples of integers by ordering at first their largest integers, in the case of ties by 
ordering their second largest integers, and so on. 

We can define the classes of these /-tuples up to permutation of their integers and congruence 
modulo n, and then represent every class by the /-tuple of nondecreasing integers between 1 and n. 
Then our ordering of /-tuples of ordered integers takes the following form, (ii, ■ ■ ■ ,ii) < (i[,..., i[) 
if and only if there exist a subscript j such that ij < i'j and i k = i' k for k = j + 1,..., /. 

We begin our proof of Lemma 7.1 with the following basic result. 

Lemma 7.2. It holds that 

i 

d et {Bi,i)= Y, "IF IK 

j = 1 

where a tuple (*i ,... ,ii) may contain repeated elements, 

^ ^ det( q^,...,g:^), (7.3) 

(*i> •••>*;) 

and (ii,..., i[) ranges over all permutations of (ii ,..., i{). 
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Proof. By using (7.1) we can expand det (Bu) as follows, 


n n n 

det (Bij) = det (E 5 ^ ^ ^^2 + 1^2 1 • • • 1 ^ ^ ^ki~\~l — l^ki^ 

ki — 1 /c2 — 1 ki = 1 

n n n 

= 'y ' ti x det ^ ' cxk+itk , • ■ •, ^ ' otkL+i—itki'j 

ii=l k— 1 fc;=l 

n n n n 

= E^E tj 2 det ^d^ , C^ 2 , 2 pl, ^ ^ dfc 2 _j_ 2 tfc 2 , * • * ; ^ ^ Qki+l — Pki^ 


1 = 1 i 2 =l 


fc 2 = l 


fe;=l 


^ ' f'ii ^ ^ t ? ; 2 • • • ^ ( t?: ; det (di x , dj 2 _|_i,..., d^-f-z —i ) • (1 -4) 

*1 = 1 i 2 —1 U =1 

Consequently the coefficient aj-ji__ i ti of any term JI* =i is the sum of all determinants 

det (d^, d_|_i,..., d.j' p/—i) 

where (i^,..., i() ranges over all permutations of (ii,..., if), and we arrive at (7.3). 

□ 

In particular, the coefficient of the term t\ is .^ = det(di, d 2 , ■ ■ ■, otf). This coefficient 

equals zero because Bii is identically zero, by assumption of lemma 7.1, and we obtain 

det(di, d 2 , • ■ •, on) = 0. (7-5) 

This is the basis of our inductive proof of Lemma 7.1. In order to complete the induction step, it 
remains to prove the following lemma. 

Lemma 7.3. Let J = (*i,..., ii) be a tuple such that 1 < ii < i 2 < ■ ■ ■ < k < n. 

Then J is a subscript tuple of the coefficient of the term rij=i Lj-j+i in equation (7.3). 
Moreover, J is the single largest tuple among all subscript tuples. 

Proof. Hereafter det(dj' i , c^+i,..., di' + ;_i) is said to be the determinant associated with the permu¬ 
tation {)(,... ,i[) of (*i,..., ii) in (7.3). Observe that det ( a ,..., dj,) is the determinant associated 
with X = (ii, *2 — 1,..., ii — l + 1) in the coefficient a n i 

llj = l ij 3 + 1 

Let X’ be a permutation of X. Then X’ can be written as X’ = ( i si — S! + 1, i S2 — s 2 + 1, • • •, i Si ~ 
Si + 1), where (si,..., si) is a permutation of (1,..., l). The determinant associated with X’ has the 
subscript tuple J' = (i si — Si + 1, i S2 — «2+2, • • -, i si — Si+l). j satisfies the inequality j < ij < n—l+j 
because by assumption 1 <ii < ii < ■ ■ ■ < ii < n, for any j = 1,2,... ,1. Thus, i Sj — Sj + j satishes 
the inequality j < i Sj — Sj + j < n — l + j < n, for any Sj. This fact implies that no subscript of X' 
is negative or greater than n. 

Let J" = (i Sri — s ri + r\,i Sr2 — s r2 + 7 - 2 ,..., i Sri ~ s n + n) be a permutation of J such that 
its elements are arranged in the nondecreasing order. Now suppose J" > J. Then we must have 
is r , — s n + ri > ii- This implies that 


H - is n < U - s rr (7.6) 

Observe that 

l - s r , < ii — i Sri (7.7) 

because *i < *2 < ■ ■ ■ < ii by assumption. Combine bounds (7.6) and (7.7) and obtain that 

l — s n < ii — i Sri < ri — s n and hence r; = l. 

Apply this argument recursively for l — 1,..., 1 and obtain that r :i = j for any j = 1,...,/. 

Therefore J = J' and X' = X. It follows that J is indeed the single largest subscript tuple. □ 
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By combining Lemmas 7.2 and 7.3, we support the induction step of the proof of Lemma 7.1, 
which we summarize as follows: 

Lemma 7.4. Assume the class of l-tuples of l positive integers written in the increasing order in 
each l-tuple and write det(I) = det(«i 1 , aq 2 ,..., a*,) if I = (aq , a, 2 , ..., a,,). 

Then det(J) = 0 provided that det(J) = 0 for all J < I. 

Finally we readily deduce Lemma 7.2 by combining this result with equation (7.5). This completes 
the proof of Theorem 7.1. □ 

Corollary 7.1. Assume any nonsingular nx n matrix A and a finite setS of cardinality |<S|. Sample 
the values of the n coordinates v \,..., v n of a vector v at random from this set. Fix a complex f and 
define the matrix H = Zf(v ) of size n x n, with the first column vector v = (uj)” =1 . Then GENP 
and BGE are safe for the matrix AH 

(i) with a probability of at least 1 — 0.5(n — l)n/|<S| if the values of the n coordinates Vi ,..., v n 
of a vector v have been sampled uniformly at random from a finite set S of cardinality |«S| or 

(ii) with probability 1 if these coordinates are i.i.d. Gaussian variables. 

(Hi) The same claims hold for the matrix FA. 

Proof. Theorems 4.1, 7.1, and 2.1 together imply claims (i) and (ii) of the corollary. By applying 
transposition to the matrix AH , extend the results to claim (iii). □ 

7.2 GENP with any one-sided circulant pre-processing fails numerically 
for some specific inputs 

By virtue of Corollary 7.1 random circulant pre-processing is a universal means for ensuring safe 
GENP, but is it also a universal means for ensuring numerically safe GENP? 

By virtue of [P16, Corollary 6.3.1], the answer is “No”, 8 and moreover GENP is numerically 
unsafe when it is applied to the DFT matrix already for a reasonably large integer n as well as to 
the matrices Q, n Z\(v) and any vector v of dimension n, that is, for any circulant matrix Zi(v), and 
consequently for a random circulant matrix Zfiv). By combining the proof of [P16, Corollary 6.3.1] 
with Theorem 5.1 one can immediately extend the result to any /-circulant multiplier Zf(v) for any 
f ^ 0. It follows that GENP also fails numerically for the input pairs (A,H) where A = G. n Q and 
H = Q h Zfiy) for any unitary matrix Q. 

Surely one does not need to use GENP in order to solve a linear system of equations with a 
DFT coefficient matrix, but the above results reveal the difficulty in finding universal classes of 
structured pre-processing for GENP. Having specific bad pairs of inputs and multipliers does not 
contradict claim (ii) of Corollary 4.1, and actually in extensive tests in [PQZ13] and [PQY15] very 
good numerical stability has been observed when we applied GENP to various classes of nonsingular 
well-conditioned input matrices with random circulant multipliers. 


8 Alternatives: augmentation and additive pre-processing 

Other randomization techniques, besides multiplications, are also beneficial for various fundamental 
matrix computations (see [Mil], [HMT11], [PGMQ], [PIMR10], [PQ10], [PQZC], [PQ12], [PZa], and 
the bibliography therein). By combining our present results with those of [PZa], we next supports 
GENP and BGE by means of randomized augmentation of a matrix, that is, of appending to it 
random rows and columns, as well as by means of an alternative and closely related technique of 
additive pre-processing (cf. (8.2)-(8.4)). 

8 This nontrivial result has been deduced by using the recent specialization in [P15] to Cauchy and Vandermonde 
matrices of the general techniques of transformation of matrix structure proposed in [P90]. 
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8.1 Gaussian augmentation 


By virtue of claim (i) of [PZa, Theorem 10.1], properly scaled Gaussian augmentation of a sufficiently 
large size is likely to produce strongly nonsingular and strongly well-conditioned matrices. Namely, 

this holds when we augment an n x n matrix A and produce the matrix K = . Here U 


and V are n x h Gaussian matrices filled with 2 hn i.i.d. Gaussian random entries and v < h < n, for 
v denoting an upper bound on the numerical nullity of the leading blocks, that is, on their numerical 
co-rank. In the dual version of that theorem, average matrix K is strongly nonsingular and strongly 
well-conditioned if A is a Gaussian matrix and if the matrices U and V have full rank and are scaled 
so that ||[/|| « ||V|| ~ 1, are well-conditioned. 


8.2 SRFT augmentation 

By virtue of [PZa, Section 8] we are likely to succeed if we apply GENP to the matrix K above 
obtained by augmenting a nonsingular and well-conditioned matrix A with SRFT matrices U and 
V of a sufficiently large size replacing the Gaussian ones of the previous subsection. 9 

Let us supply some details. Claim (ii) of [PZa, Theorem 10.1] implies that we are likely to produce 
a strongly nonsingular and strongly well-conditioned matrix K if U and V are SRFT matrices such 
that v > q = cn for a sufficiently large constant c and if 

4 (V"+ \/81og 2 (i/n)) log 2 {v)<h. (8.1) 

Indeed, the qxq leading blocks K qq of the matrix K are the identity matrices I q for q = 1,..., h, 
and so we only need to estimate the probability that the leading blocks K q q are well-conditioned 
for all q > h because their nonsingularity with probability 1 readily follows from Theorem 2.1. 

It is proven in [PZa, Section 8] that under (8.1) this property holds with a probability at least 
1 — c' / q for a constant d and a fixed q. Therefore it holds for all q , q = h + 1,..., h + n with a 
probability at least 1 — d J2 q =h+ i V? ~ 1 — d ln(^±y) = 1 — d ln(l + fipy) > 1 — d ln(l + 1/c). This 
is close to 1 for a sufficiently large constant c, and our claim about the matrix K follows. 


8.3 Linking augmentation to additive pre-processing 

Consider an augmented matrix K and its inverse AT -1 . Then its n x n trailing (that is, southeastern) 
block is C -1 for C = A — UV T . Indeed 


T - _ ( hi Oh,n\ f Ih Oh,n\ f Ih L T \ 

V ~ \U In J \O n>h C ) \O n , h In)' 

Consequently 

K ~ 1 = (o h T) (o c-i)( Ih u 

\Un,h In ) \<J n ,h C ) \~U l n ) 

C -1 = diag (Oh’hln)!^- 1 diag(0; l j,/ n ), and the claim follows. 

Now deduce that ||A' _1 ||/7V < ||C ,_1 || < ||/\ _1 ||, ||/\||/A < ||C'~ 1 || < N ||A'||, and hence 

k{K)/N 2 < k(C) < Nk(K), for N = (1 + |]H||)(1 + ||H||). 


( 8 . 2 ) 

(8.3) 


(8.4) 


Equations (8.2)-(8.4) closely link augmentation A K with additive pre-processing A —» C and also 
link the leading blocks of the augmented matrices with those output by additive pre-processing. 

Indeed readily extend our observations to obtain that the k x k trailing submatrix of the leading 
block h+k of the matrix K is the k x k leading block C k \ of the matrix C -1 and that 

ft(A h+k,h+k)/H — K(C ktk ) ^ N&{,IAh+k,h+k)i (8.5) 


for k = 1,..., n. If the factor N is reasonably bounded, which is likely for Gaussian matrices U and 
V, then the matrix Ck,k is nonsingular and well-conditioned if and only if so is the matrix Kh+k,h+k- 

9 The same results and estimates hold if one substitutes matrices of SRHT for the ones of SRFT (cf. [Til]). 
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8.4 Transition back to computations with the original matrix. Expansion 
and homotopy continuation 

Having computed the inverses A' -1 and C~ l by applying GENP or BGE to the augmented matrix 
K , one can simplify computation of the inverse A -1 of the original matrix A by applying the 
Sherman-Morrison-Woodbury formula 10 

A- 1 = C~ x - C~ 1 U(I h + V T C- 1 U)- 1 V T C~ 1 1 (8.6) 

for C = A — UV T (cf. [GL13, page 65]). 

Computing the inverse A -1 by means of SMW formula (8.6) may still cause numerical problems 
at the stages of computing and inverting the matrix Ih + V T C~ l U , but they are less likely to occur if 
the matrix C is nonsingular and well-conditioned, which we expect to be the case in this application. 

In order to strengthen the chances for the success of this approach we can apply some heuristic 
recipes. In our tests with benchmark inputs, we succeeded by simply doubling the lower bound 
v on the dimension h of additive pre-processing or equivalently by keeping the same bound v but 
requiring that 2 u < h < n. Another natural remedy is the well-known general technique of homotopy 
continuation, with which one proceeds as follows. Fix two matrices U and V as before and define 
the matrices C(t) = A — tUV t and S(t) = I h + tV t C(t)~ 1 U for a nonnegative parameter r. 
Suppose that the matrix 5(f) is diagonally dominant for some positive f. Notice that the matrices 
(7(1) = A — UV T and 5(0) = Ih are readily invertible, fix the decreasing sequence of the values 7>, 
k = 0,1such that tq = 1 > t± > • • ■ > r; = 0, and compute the sequence of the matrices 
TjV T C(Tj)~ 1 U , for j = 0,1,...,/, by extending SMW formula (8.6) as follows, 

C(r j+ r)- 1 = Cirj)- 1 - A j C(T j )~ 1 U(I h + A j V T C(T j )- 1 U)~ 1 V T C(r j )-\ 

for A j = Tj + \—Tj. For sufficiently small values A j, the matrices Ih + AjV T C(Tj)~ 1 U are diagonally 
dominant and readily invertible, and then we can numerically safely perform l homotopy continuation 
steps for the transition from the inverse (7(1) _1 = (7 -1 to C'(O) -1 = A -1 . 

By generalizing SMW formula (8.6) and writing C = A — UV T , we can readily express the inverse 
A' -1 of the augmented matrix K, 

- ( 0 4 - V X) (% °r ). * - (i V I) - § °r) v d 

\<Jn,h ^ J \—U 1 n J \U A J yU l n J \Un,h ^ J 

9 Numerical Experiments 

Numerical experiments have been performed, by using MATLAB, by the second author in the 
Graduate Center of the City University of New York on a Dell computer with the Intel Core 2 
2.50 GHz processor and 4G memory running Windows 7. Gaussian matrices have been generated 
by applying the standard normal distribution function randn of MATLAB. We refer the reader to 
[PQZ13], [PQY15], and [PZa] for other extensive tests of GENP with randomized pre-processing. 

Tables 9.1 9.4 show the maximum, minimum and average relative residual norms 11 Ay — b||/| |b|| 
as well as the standard deviation for the solution of 1000 linear system Ax = b with Gaussian vector 
b and n x n input matrix A for each n, n = 256, 512,1024 and 10 linear systems for n = 2048,4096, 

A =(o c)’ f 9 - 1 * 

with k x k blocks A*,, B, C and D , for k = n/2, scaled so that ||13|| « ||(7|| ~ ||D|| « 1, the k — 4 
singular values of the matrix A*, were equal 1 and the other ones were set to 0 (cf. [H02, Section 
28.3]), and with Gaussian Toeplitz matrices B, (7, and D , that is, with Toeplitz matrices of (5.1), 
each defined by the i.i.d. Gaussian entries of its first row and first column. (The norm 11 A -1 11 ranged 
from 2.2 x 10 1 to 3.8 x 10 6 in these tests.) The linear systems have been solved by using GEPP, 

10 Hereafter we use the acronym SMW. 
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GENP, or GENP pre-processed with real Gaussian, real Gaussian circulant, and random circulant 
multipliers, each followed by a single loop of iterative refinement. 

In the tests covered in Table 9.4, the matrix A was set to equal Q, the matrix of DFT(n). For 
pre-processing, either Gaussian or Gaussian unitary circulant matrices C = Q~ 1 D(fl'v)Cl have been 
used as multipliers, with v = (u,)^T 0 , Vi = exp(27r<^j-\/ — 1 /rz) and n i.i.d. real Gaussian variables <j>i, 
i = 0,.. ., n — 1 (cf. Theorem 5.1 and Remark 5.5). 

As should be expected, GEPP has always produced accurate solutions, with the average relative 
residual norms ranging from 10~ 12 to 7 x 10 -13 , but GENP with no pre-processing has consistently 
produced corrupted output with relative residual norms ranging from 1(R 3 to 10 2 for the input 
matrices A of equation (9.1). Even much worse was the output accuracy when GENP with no 
pre-processing or with Gaussian circulant pre-processing was applied to the matrix A = 17. In 
all other cases, however, GENP with random circulant pre-processing and with a single loop of 
iterative refinement has produced solution with desired accuracy, matching the output accuracy 
of GEPP. Furthermore GENP has performed similarly when it was applied to a nonsingular and 
well-conditioned input pre-processed with a Gaussian multiplier. 


Table 9.1: Relative residual norms: GENP with Gaussian multipliers 


dimension 

iterations 

mean 

max 

min 

std 

256 

0 

6.13 x 10~ y 

3.39 x 10" B 

2.47 x 10 -12 

1.15 x 10~ 7 

256 

1 

3.64 x 10" 14 

4.32 x 10~ 12 

1.91 x 10 _u> 

2.17 x 10~ i3 

512 

0 

5.57 x 10“ 3 

1.44 x 10~ 5 

1.29 x 10 -11 

7.59 x 10~ Y 

512 

1 

7.36 x 10 _i3 

1.92 x 10" iU 

3.32 x 10 -i5 

1.07 x 10- 11 

1024 

0 

2.58 x 10” 7 

2.17 x 10~ 4 

4.66 x 10 -11 

6.86 x 10" 6 

1024 

1 

7.53 x 10 _i2 

7.31 x 10~ y 

6.75 x 10“ i5 

2.31 x 10~ iU 

2048 

0 

4.14 x 10~ y 

8.16 x 10 -y 

8.27 x 10 _1U 

3.72 x 10~ 9 

2048 

1 

7.61 x 10" i2 

1.08 x 10" 11 

3.27 x 10“ i2 

3.89 x 10“ 12 

4096 

0 

5.02 x 10" 7 

1.23 x 10" e 

1.14 x 10" 7 

6.29 x 10" 7 

4096 

1 

5.44 x 10 _ii 

1.53 x 10~ iU 

2.64 x 10 -12 

8.52 x 10 _ii 


Table 9.2: Relative residual norms: GENP with Gaussian circulant multipliers 


dimension 

iterations 

mean 

max 

min 

std 

256 

0 

8.97 x 10 _ii 

1.19 x 10~ s 

6.23 x 10" 13 

4.85 x lO^ 10 

256 

1 

2.88 x 10 -14 

2.89 x 10 -12 

1.89 x 10 -15 

1.32 x 10- 13 

512 

0 

4.12 x 10 _1U 

3.85 x 10~ 3 

2.37 x 10 -12 

2.27 x 10” 9 

512 

1 

5.24 x 10 -14 

5.12 x 10" 12 

2.95 x 10 -15 

2.32 x 10 -13 

1024 

0 

1.03 x 10” 8 

5.80 x 10~ e 

1.09 x lO' 11 

1.93 x lO’ 7 

1024 

1 

1.46 x 10 -13 

4.80 x RT 11 

6.94 x 10 -15 

1.60 x 10~ 12 

2048 

0 

1.03 x 10“ 8 

2.87 x 10" s 

1.13 x 10" 9 

1.59 x 10” 8 

2048 

1 

3.74 x 10 -13 

6.09 x 10" 13 

9.69 x 10 -14 

2.59 x 10 -13 

4096 

0 

2.46 x 10 _y 

4.17 x 10" s 

1.93 x 10 _1U 

2.05 x 10’ 9 

4096 

1 

7.82 x 10 -13 

1.35 x 10 -12 

2.02 x 10 -13 

5.72 x lO^ 13 


Next we cover our tests of GENP pre-processed with some multipliers defined by means of 
combining matrices of Section 6. The test results are represented in Tables 9.5 and 9.6. 

In this series of our tests we set n = 128 and applied GENP to matrices of (9.1) and six families of 
benchmark matrices from [BDHT13], pre-processed with multipliers combining the ones of following 
three basic families. 

Family 1: The matrices APF of depth 3 (with d = 3) and with a (single) random permutation. 
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Table 9.3: Relative residual norms: GENP with circulant multipliers filled with ±1 


dimension 

iterations 

mean 

max 

min 

std 

256 

0 

2.37 x 10"^ 

2.47 x 10" iU 

9.41 x 10 -i4 

1.06 x 10- 11 

256 

1 

2.88 x 10" 14 

3.18 x 10~ L2 

1.83 x 10" ib 

1.36 x lO^ 13 

512 

0 

7.42 x 10 -12 

6.77 x 10" 1U 

3.35 x lO" 13 

3.04 x nr 11 

512 

1 

5.22 x 10 -i4 

4.97 x 10~ i2 

3.19 x 10~ i5 

2.29 x 10 -13 

1024 

0 

4.43 x 10 -11 

1.31 x 10~ s 

1.28 x 10~ 12 

4.36 x 10 _1U 

1024 

1 

1.37 x 10 -ia 

4.33 x 10- 11 

6.67 x 10 -i5 

1.41 x 10^ 

2048 

0 

5.42 x 10 -y 

1.59 x 10~ s 

1.54 x 10 -10 

9.04 x 10~ 9 

2048 

1 

1.17 x 10" i3 

2.40 x 10" 13 

5.14 x 10“ i4 

1.07 x 10“ 13 

4096 

0 

1.22 x 10” s 

2.47 x 10" 8 

6.41 x 10 -1U 

1.21 x 10” 8 

4096 

1 

2.29 x 10 -13 

4.36 x 10" 13 

1.05 x IQ" 13 

1.81 x IQ" 13 


Table 9.4: Relative residual norms: GENP for DFT(n) with Gaussian multipliers 


dimension 

iterations 

mean 

max 

min 

std 

256 

0 

2.26 x 10 -12 

4.23 x 10" 11 

2.83 x 10 -13 

4.92 x 10 -1 ^ 

256 

1 

1.05 x 10 -ib 

1.26 x 10- 15 

9.14 x 10 _1B 

6.76 x 10- iY 

512 

0 

1.11 x 10 -11 

6.23 x lO^ 10 

6.72 x 10" 13 

6.22 x nr 11 

512 

1 

1.50 x 10 -15 

1.69 x lO^ 15 

1.33 x 10 -15 

6.82 x 10- 17 

1024 

0 

7.57 x 10 _iU 

7.25 x 10" 8 

1.89 x 10 -12 

7.25 x 10” 9 

1024 

1 

2.13 x 10“ ib 

2.29 x 10^ ib 

1.96 x 10“ i5 

7.15 x 10^ iY 

2048 

0 

2.11 x 10 -11 

3.05 x 10- 11 

1.64 x lO' 11 

8.08 x 10 -12 

2048 

1 

1.47 x 10 _i3 

2.73 x lO" 13 

8.10 x 10 _i4 

1.09 x lO" 13 

4096 

0 

1.36 x 10 _1U 

3.01 x 10" 1U 

4.52 x lO" 11 

1.43 x 10 _1U 

4096 

1 

6.12 x 10" ia 

9.69 x 10- 13 

1.91 x 10" 13 

3.93 x 10- ia 


Family 2: Sparse circulant matrices C = where the vector v has been filled with 

zeros, except for its ten coordinates filled with ±1. Here and hereafter each sign + or — has been 
assigned with probability 1/2. 

Family 3: Sum of two inverse bidiagonal matrices. At first their main diagonals have been filled 
with the integer 101, and their first subdiagonals have been filled with ±1. Then each matrix have 
been multiplied by a diagonal matrix diag(±2 &i ), where 6,; were random integers uniformly chosen 
from 0 to 3. 

We combined these three basic families of multipliers and tested GENP on their ten combinations, 
listed below. For each combination we have performed 1000 tests and have recorded the average 
relative error 11 Ax — b| |/| |b| | with matrices A from the seven benchmark families and vectors b being 
standard Gaussian vectors. Here are these ten combinations. 

1. F = I, H is a matrix of Family 1. 

2. F = /, H is a matrix of Family 3. 

3. F = H is a matrix of Family 1. 

4. F = H is a matrix of Family 3. 

5. F is a matrix of Family 1, if is a matrix of Family 3. 

6. F = I, H is the product of two matrices of Family 1. 

7. F = i, H is the product of two matrices of Family 2. 

8. F = I, H is the product of two matrices of Family 3. 

9. F = I, H is the sum of two matrices of Families 1 and 3. 

10. F = I, H is the sum of two matrices of Families 2 and 3. 

We tested these multipliers for the same linear systems as in our previous tests in this section 
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and for six classes generated from Matlab, by following their complete description in Matlab and 
[BDHT13]. Here is the list of these seven test classes. 

1. The matrices A of (9.1). 

2. ’circul’: circulant matrices whose first row is a standard Gaussian random vector. 

3. ’condex’: counter-examples to matrix condition number estimators. 

4. ’fiedler’: symmetric matrices generated with (i,j) and (j,i) elements equal to c* — Cj where 
ci,... ,c n are i.i.d. standard Gaussian variables. 

5. ’orthog’: orthogonal matrices with (i,j) elements sin^j. 

6. ’randcorr’: random n x n correlation matrices with random eigenvalues from a uniform 
distribution. (A correlation matrix is a symmetric positive semidefinite matrix with ones on the 
diagonal.) 

7. ’toeppd’: n x n symmetric, positive semi-definite (SPSD) Toeplitz matrices T equal to the 
sums of m rank-2 SPSD Toeplitz matrices. Specifically, 

T = w(l) * T(0( 1)) + ... + w{m) * T(9(m)) 

where 9{k) are i.i.d. Gaussian variables and T(9(k)) = (cos(27r(i — j)0(fe)))" -_i- 

In our tests, for some pairs of inputs and multipliers, GENP has produced no meaningful output. 
In such cases we filled the respective entries of Tables 9.5 and 9.6 with oo. 

GENP pre-processed with our multipliers of the 9th combination of three basic families, has 
produced accurate outputs without iterative refinement for all seven benchmark classes of input 
matrices. With the other combinations of the three basic families of our multipliers, this was 
achieved from 4 to 6 (out of 7) benchmark input classes. For comparison, the 2-sided pre-processing 
with PRBT-based multipliers of [BDHT13] and [BBBDD14] always required iterative refinement. 


Table 9.5: Relative residual norms output by pre-processed GENP with no refinement iterations 


class 

1 

2 

3 

4 

5 

6 

7 

1 

2.61e-13 

6.09e-15 

OO 

2.62e+02 

7.35e-15 

1.38e-12 

3.04e-13 

2 

2.02e+02 

4.34e-14 

5.34e-16 

OO 

7.35e+02 

5.27e-15 

3.23e-15 

3 

4.34e-13 

8.36e-15 

OO 

3.03e+02 

1.94e-14 

3.04e-13 

3.21e-13 

4 

1.48e+01 

1.36e-12 

2.39e-16 

1.01e-ll 

4.71e+01 

5.09e-15 

5.12e-15 

5 

3.71e-ll 

2.21e-14 

OO 

2.85e+01 

5.83e-10 

2.23e-12 

1.34e-12 

6 

3.33e-13 

9.36e-15 

oo 

3.66e-05 

7.04e-15 

3.75e-13 

2.11e-13 

7 

7.76e-12 

3.55e-14 

9.91e+01 

7.90e+00 

7.75e+00 

7.11e+00 

1.05e+01 

8 

7.95e+00 

9.55e-14 

7.56e-16 

OO 

5.74e+03 

6.51e-15 

3.57e-15 

9 

5.36e-13 

1.51e-14 

4.26e-16 

2.24e-ll 

3.68e-13 

6.47e-15 

4.92e-15 

10 

3.50e-12 

8.43e-14 

3.43e-13 

2.90e-10 

1.36e+01 

3.53e-13 

1.67e-13 


Finally we present the results of our tests of GENP with additive pre-processing applied to the 
same nxn test matrices A of (9.1), but for n = 128, 256, 512,1024. In this case we applied GENP to 
the matrix C = A — UV T where U and V were n x h random Gaussian subcirculant matrices, each 
defined by the n i.i.d. Gaussian entries of its first column and scaled so that ||A|| = 2||[/R T ||. Then 
we computed the solution x to the linear system Ax = b for a Gaussian vector b by substituting 
the SMW formula (8.6) into the equation x = A~ 1 b. 

We present the test results in Table 9.7. The statistics were taken over 100 runs for each n. The 
results changed little when we scaled the matrices U and V to increase the ratio ||A||/||[/R T || to 10 
and 100. 
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Table 9.6: Relative residual norms output by pre-processed GENP followed by a single refinement 
iteration 


class 

1 

2 

3 

4 

5 

6 

7 

1 

1.13e-15 

6.90e-17 

OO 

1.12e+00 

5.23e-17 

2.10e-16 

1.05e-15 

2 

5.07e-04 

7.71e-17 

1.03e-16 

OO 

4.40e+02 

1.99e-16 

1.19e-15 

3 

1.14e-15 

7.34e-17 

OO 

5.43e-13 

5.15e-17 

2.24e-16 

1.10e-15 

4 

1.55e-03 

6.19e-17 

1.31e-16 

5.69e-13 

2.69e+02 

2.13e-16 

1.17e-15 

5 

9.80e-16 

6.96e-17 

OO 

6.75e+01 

5.35e-17 

2.47e-16 

9.84e-16 

6 

1.08e-15 

6.13e-17 

OO 

6.35e-13 

5.08e-17 

1.86e-16 

1.05e-15 

7 

3.47e+01 

6.17e-17 

2.61e+06 

5.21e+00 

5.31e-17 

1.97e-16 

9.97e-16 

8 

2.56e-04 

6.67e-17 

1.15e-16 

OO 

7.96e+02 

1.98e-16 

1.08e-15 

9 

9.81e-16 

7.44e-17 

3.99e-17 

6.40e-13 

5.09e-17 

2.02e-16 

1.15e-15 

10 

9.79e-16 

8.32e-17 

1.14e-16 

7.34e-13 

4.07e+01 

2.23e-16 

1.04e-15 


Table 9.7: Relative residual norms of GENP with Gaussian subcirculant additive pre-processing 


n 

h 

iterations 

mean 

max 

min 

std 

128 

4 

0 

1.47e-10 

2.51e-09 

1.05e-12 

3.85e-10 

128 

4 

1 

1.58e-14 

3.39e-13 

1.26e-15 

4.48e-14 

256 

4 

0 

8.14e-10 

2.87e-08 

1.20e-ll 

3.02e-09 

256 

4 

1 

3.57e-14 

1.16e-12 

2.52e-15 

1.24e-13 

512 

4 

0 

8.86e-09 

3.03e-07 

4.42e-ll 

3.44e-08 

512 

4 

1 

2.16e-13 

1.35e-ll 

4.60e-15 

1.36e-12 

1024 

4 

0 

2.12e-08 

3.06e-07 

1.45e-10 

4.98e-08 

1024 

4 

1 

9.87e-14 

1.95e-12 

6.64e-15 

2.38e-13 


10 Conclusions 

Gaussian elimination with partial pivoting is the workhorse for modern matrix computations, but it 
is significantly slowed down by communication intensive pivoting, both for inputs of small and large 
sizes. Pre-processing with random and fixed multipliers as well as by means of augmentation are 
efficient alternatives to pivoting according to the results of extensive tests in this paper and a number 
of previous papers. Our novel insight provides formal support for these empirical observations and 
embolden widening the search area for efficient pre-processors. We present our initial findings in our 
search for new classes of such pre-processors and confirm their efficiency with our numerical tests. 

In [PZ16] and [PZa] our techniques yield similar results for the fundamental problem of low-rank 
approximation of a matrix and for the approximation of two singular spaces of a matrix associated 
with the two sets of its largest and smallest singular values separated by a gap, respectively. 

For the latter subject our substantial new research progress is under way. We also plan to extend 
our present results to the computation of Rank Revealing LU Factorization of [POO]. 

Acknowledgements: We thank a reviewer for valuable comments and acknowledge support by 
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